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1. Introduction. This paper unifies and extends important theoretical 
results on empirical risk minimization and model selection. It makes ex- 
tensive and efficient use of new probability inequalities for the amount of 
concentration of the (possibly symmetrized) empirical process around its 
mean. The results are very subtle and very pleasing indeed, as they show 
that oracle inequalities exist for very general problems. 

There are in my view two aspects which need special attention. First, 
the paper assumes that the loss functions / 6 T satisfy |/| < K for some 
fixed constant K. Let us call this the uniform bound condition (Condition B 
below). Second, it is not clear how the approach used will work in practice: 
the estimators depend on (unspecified) constants which may be too large 
for all practical purposes, and moreover, it is difficult to explain the method 
to nonspecialists. This discussion will address these two problems. 

We reformulate some of the results as starting point for possible extensions 
or alternative approaches. For transparency, we will invoke simple, and not 
the most general, assumptions. 

Section 2 in this discussion presents a distribution-dependent upper bound 
for the excess risk, replacing the uniform bound condition by convexity con- 
ditions and a bound on the renormalized loss functions (Condition BB). 

The background of Section 3 in this discussion is the question whether 
cross-validation can be a more user-friendly model selection method than 
applying bounds in terms of Rademacher complexities. We first study why 
(data-dependent) upper and lower bounds for excess risks are useful when 
aiming at oracle behavior in model selection. We then show that when the 
margin behavior of the excess risk in each model is known, cross-validation 
can lead to oracle behavior. 
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2 S. VAN DE GEER 

Let us now first introduce our notation, following mostly that of the paper. 
Assume the observations X±, . . . ,X n are i.i.d. copies of a random variable 
16S with distribution P. Let T be a given class of functions / on S. The 
empirical risk minimizer is 

/ :=argminP n /, 

and its theoretical counterpart is 

/ := argminP/. 

ft? 

We assume for simplicity that the minimizer s exist. The excess risk at / is 
defined as £ (/) := Pf - Pf. 

A distribution-dependent upper bound for £(/) depends on two ingredi- 
ents, which we refer to as (1) the empirical process behavior and (2) the 
margin behavior. 

Let 

a 2 (f):=Pf 2 -(Pf) 2 , 

and let 

^:={/G^:a(/-/)< ff ,|/-/|<l}. 
Consider the maximal increment of the empirical process 
Z(a):= sup \P n (f-f)-P(f-f)\. 

The empirical process behavior is the behavior of EZ(cj) as function of a. 
Bousquet's inequality [1] implies that for all e > 0, 

(1) P[Z(a)>(l+£)EZ(a) + aJ-+(l + -)-)<e~ t Vt > 0. 
\ V n \6 e J n J 

The margin behavior of Pf is the behavior of £ (/) for a(f — f) small. 
This is described by 

D(5) = supM/ - /) : / S F:£(f) < 5}. 

Condition A below (and also Conditions CC, C and {C(k)}) imposes certain 
conditions on the margin behavior. 

We now combine empirical process behavior and margin behavior in the 
quantity Wt(D(5)), where 

W t (a) = yBZ(a) + a 
5 

is inspired by (1). We set (quite arbitrarily) the value of e at e = 3/5, that 
is, we do not attempt here to optimize our constants (for simplicity). 
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Lemma 1 below (and its proof) is a slight variant of the approach used 
in the paper. It will be applied in Lemmas 2 and 3 to obtain distribution- 
dependent bounds. In the lemma we invoke conjugates. The conjugate of a 
convex nondecreasing function G on [0, oo) with G(0) = is defined as the 
function H(v) = sup u>0 [uv — G(u)]. 

Let us now fix some t > 0, and assume 



Condition A. There exists a strictly increasing concave upper bound 
ip t (6) ofW t (D(5)), satisfying 

(i) Vt" 1 has conjugate H t , 

(ii) tpt(5)/5 is nonincreasing in 5. 

The conjugate H t {z) corresponds roughly speaking to a bound for the 
(t-transform U^ti^) defined in the paper. We use conjugates to clarify the 
relation with our margin Conditions CC, C and {C(k)}. 

For 5 > 0, we let 

^f = {/e^:|/-/|<l,£(/)><5}. 

Lemma 1. Suppose Condition A. Then we have for all q > 1, e > 0, 
t > and 5>0, 

P( sup 



[Pn-PKf-f) 



e(H t a) + £(f)) + 



> 



^log^e" 4 . 



Proof. Define 5j = q J , j = 0, 1, Then for any 5 > 0, and for 6j > 5, 



P ( sup 



(P n -P)(f-f) 



M£(f)) + 



2t 
qn 



>g)<^PZ(%))>#j + i) + 

2t\ 



< j2?(z(m)) > msj) + -)< ^ | e -*. 

j=o v nj o 

The result now follows, since for any e > and any x > 0, ^ 
e(ff t (i) + x). □ 



The following lemma presents an upper bound for the excess risk. The 
lemma (and its proof) is a simplified (less general) version of Theorem 1 
(and its proof) in the paper. We present the lemma here in order to allow 
comparison with the extension to the case where a uniform bound on the 
functions in T is not available (see Lemma 3 below). 
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Assuming a uniform bound condition, a distribution-dependent bound for 
the excess risk takes the form 

(2) 5 t , n := T ^H t ( 1 -) + 2t 



1 — qe \e J (1 — qe)n' 
Condition B. We have \f — f \ < 1 for all f eJ 7 . 

Lemma 2. Suppose Conditions A and B. Then for all q > 1, < e < 1/q 
and 5 > 5t t n, 

P(f(/)>5)<log,| e -*. 

PROOF. Let 5 > and let E be the event 

(P n -P)(f-f) 



sup 

f€F:£(f)>5 



e (fli(I) + £(/))+" 

Since 



qn 



< 



£(/)<|(P re -P)(/-/)|, 

we know that on i£, 

£(/) < S t>n A 5. 

Therefore, when 5 > <5(. n 

P(f (/) >8)<1- P(E) < log q |e-*. □ 

2. The case of possibly unbounded functions. In this section, we assume 
that T is indexed by a parameter 9 in some space 0: 

T={-fe-0eQ}. 

We moreover assume that is a convex subset of a normed vector space 
with norm r, and that 9 i— ► 70 (a;) is convex for all x € S. We let f = jg and 

/ = > 

Condition BB. Suppose that for some rj n > 0, 
??n|70 - 7el ^ r ( 6 ' - ^) V r? n - 
We also need a margin condition. Let Bi := {9 € : 170 — 7^| < 1}. 

Condition CC. For some increasing function D, 
D(5(7 )) > t{9 - 0) VflGGi. 
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Define now 



where St tU is given in (2). 



r n := ~D(5 t ,n) 



Lemma 3. Let q > 1 and < e < 1/q be arbitrary. Assume Conditions A, 
BB and CC, and that r n < rj n /2. Then we have 



E(f(/)>dt, n )<log. : p-e-*. 



J t,n 



Proof. Define 
with 



aO + (l-a) 



2r„ 



Q 



2r n + T {e-e) 

Then 

2r n |7e -7el < 2T nh§-Je\ < 
2r n + r(0 - 0) ~ r(0 - 9) 

Moreover, by the convexity of P n {l$), for / := fx, 

Pn(f) < aP n (f) + (1 - a)P n (f) < P n (f). 

This implies 

£(f)<\(Pn-P)(f-f)\. 

Let E n be the event 

(P n -P)(f-f) 



sup 

•'1 



+£(/)) + 



By the same arguments as in Lemma 2, we have on E n 

S(f)<St,n. 



But then 



Hence 



But then also 



So on E n also 



T{9-d)<T>{5 t , n )=T n . 
t{6 -0)< 2r n . 

\i§ - is\ < L 
£(f) < St, n . 



□ 
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3. Model selection. Consider now a set of models {J~k}, with J-f- ClJ- for 
all k. Let 

/* := argminP/, f k : = argminP/, 

fer fen 

and denote the empirical risk minimizer in model k by 

f k := argminP n /. 

feT k 

We moreover define the excess risk at fk within the model k as 

£k := P(fk ~ fk) 
and the "empirical" excess risk at fk, 

£k '■= Pn{fk — fk)- 

The overall excess risk at / is 

£*(/):= 

Let ir{k) be some (data-dependent) penalty, and 

k := argmin{P n / fc + n(k)}, 

assuming for simplicity that the minimum exists. It is important to find 
good estimates of the ("empirical") excess risks, because we can use these 
in the construction of a penalty tt. To clarify why, we reformulate Lemma 4 
in the paper, combined with part of the proof of its Theorem 6. We also 
impose its margin condition (5.3), which we refer to as Condition C. 

Condition C. We have 

£*(/)> #K/-/*)] VfeT, 

where 4> is a function with conjugate (j)* . 

Lemma 4. Assume Conditions B and C. Let < e < 1 be arbitrary and 
let {tk} be an arbitrary positive sequence. Define for all k, 

Then 

p { £ *(h) > rrhy n T {£ * {fk) + (1 " £)[a{k) + ^ fc )]>) 

< e ~ tk + P ( 3k : *( k ) < 4 + (1 - e)£ k + a(k)). 
k 
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Proof. By Bernstein's inequality, with probability at least 1 — e~ tk , 

\{P n ~ P)(fk -/*)|< \f^<r(fk -/*) + - 
V n n 

< e<j>[a{f k - /*)] + a(k) < eS*(f k ) + a{k). 

But then 

{l-e)£*{fk)<Pn{fk-U) + a{k) 

and 

Pn(fk ~ /*) < (1 + £)£.(/*) + a(fc) < J-{£,(/ fc ) + (1 - e)a(*)}. 
Let £7 be the set where it holds for all k that 

£*(/*) < Y^{Pn(fk ~ f*)+a(k)},P n (fk ~ /*) 

< j^{«.(/*) + (i-e)a(*0} 

and 

7r(fc) >5fc + (l-e)5 fc + a(fc). 

We have on E, 

Pn(h ~ /*) + = min{P„(/ fc - /,) + 7r(A;)} < mm{P n (f k - /*) + vr(fc)} 

re re 

< _L min{£*(/ fe ) + (1 - e)[a(k) + vr(fc)]}. 

I — £ k 

We also have on E, 

£*(fk) < YT- £ { P nU k - /*) + «(fc)} = J^{Pn(f k ~ /*) + % + «(*)}• 

Hence, on E 1 , 



^ (1^7)2 ^{^(/fe) + (1 " + *(*)]}• □ 

The above lemma indicates that one needs bounds for the ("empirical") 
excess risk, as well as knowledge of the margin behavior, that is, of the 
function (j). This is also the message of the paper, and it is the reason why 
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it studies such bounds. Theorem 2 in the paper handles the empirical ex- 
cess risk £ k . Its Theorem 3 shows that one can estimate the distribution- 
dependent upper bounds for the excess risk £ k . The latter is done using 
Rademacher complexities, which are based on symmetrized versions of the 
empirical process. 

Symmetrization inequalities are based on comparing P n with an indepen- 
dent copy P' n = ^J2i=l^x'.j where {X' n , . . . , X' n } is a second sample inde- 
pendent of {Xi, . . . ,X n }. The question arises whether simple data splitting 
can also be used to estimate £ k and £ k . Suppose indeed we have observed 
{X' nl X' n } in addition to {X 1: . . . , X n }. We let 

fk = argminP^/. 

Moreover, we define 

£'k = P(fk-fk), £'k = P'n{h-!k)- 
We now assume the following margin condition: 

Condition {C(A;)}. For all k, 

P(f-fk)>Mv(f-fk)} v/e^ fc , 

where 4> k has conjugate <f)* k . 

Define now for all k the (truly) empirical quantities 
P(k):={P' n -P n ){f k -f' k ). 

Lemma 5. Assume Conditions B and {C(k)}. Let < e < 1 be arbitrary 
and let {t k } be an arbitrary positive sequence. Define 

Then with probability at least 1 — J2k e ~ tk > we have for all k 
@{k) + 2 7 (fc) > (1 - e){£' k + £k} + £' k + 4- 

Proof. By Bernstein's inequality, conditionally on X\, . . . , X n , we have 
with probability at least 1 — \e~ tk , that 

(P - P' n ){f k - f k ) < J^a{f k - f k ) + *X 
V n n 

But then, 

(P-P^)(f k -f k )<eS k + 7 (k) or P^f k -f k )>(l- £ )£ k - 7 (k). 



DISCUSSION OF LOCAL RADEMACHER COMPLEXITIES 



9 



Similarly for P n (f' k - fk)- 



Let E be the set where it holds that for all k 



P^h-h)>{l-e)S k - 1 {k) 



and 




(K - P n )(f k - fk) = {P' n - Pn)(fk ~ fk) + (Pn ~ P' n )Ul ~ h) 

= P'ntfk - ft) + 4 + PnU'k - fk) + S'k 

> (1 - e){S k + 4} + 4 + £'k - 27(*)- 



□ 



It follows that if the margin behavior of all models {J~k\ and T is known, 
one may take as data-dependent penalty 



One can then apply Lemma 4. One may proceed by proving a distribution- 
dependent upper bound for this choice of Tt(k) (this bound actually follows 
from the paper). The penalty clearly has as advantage that it is simple to 
implement. But as it requires the margin behavior to be known, there are 
many problems (e.g., classification) where it cannot be used. In the paper, 
Theorems 5 and 6, only the margin behavior of the model J- is assumed to 
be known. Thus, one might say that by estimating the upper bounds for 
the excess risks (using Rademacher complexities), instead of the excess risks 
themselves, one overcomes the problem of not knowing the margin behavior 
of all models T^. 

The paper moreover shows that by replacing the penalization method by 
a comparison method, the margin problem can be solved. Another promising 
approach is in my view the use of ^i-type penalties, but these are only defined 
within the context of linear parameter spaces. 
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(3) 



7r(jfc) = (3(k) + a(k) + 2 7 (fc). 
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